Part 2:

Takens' Embedding Theorem

The Theorem

Using the Taken's Theorem we can embedd the time series we have available into a phase space where we do not have a temporal dependence between our data points anymore. This is important because the Statistical Learning Theory only guarantees the learning of an algorithm if the samples (rows or data instances) of our dataset are independent from each other.

In order to define this embedding, we need two hyperparameters named d and m which are the time delay and the embedding dimension respectively. The time delay defines how many data points, previously to the actual one, we will use on the embedding. The dimension will define the final shape of our dataset and how many axies there are in the phase space, each axis representing a time delay.

In this part, we will explore the Takens embedding theorem and how one can use it to forecast time series using standard machine learning algorithms without relaxing the statistical learning theory lemma that the rows of our dataset must be independent from each other.

We will start by describing the theorem, applying it and comparing its result with the Giotto-TDA implementation and then learn how to parametrize it and how to automatically find the embedding parameters. Then, we will discuss how to use it for forecasting.

The function below will generate the embedding for a series given the m and the d parameters. We will then compare it with the implementation from the Giotto-TDA library to see if we implemented it correctly.

Embedding a Sine Function

As we can see by the phase-space plot, it seems that our implementation is correct.

Embedding the Lorenz data

Again, it seems that the implementation is correct. However, these two datasets have known embedding dimension and time delay parameters for the best attractor. If we are dealing with a real time series we don't know much about, we need a way of estimating these values. This is what we are going to do now.

Defining the delay

To find the time delay to be used, we are going to use the Mutual Information of the time series. The idea behind this metric is that the first local minima of the graph is the optimal delay for our embedding. This is an empirical strategy and therefore is not guarantee to generate the best delay possible.
To find it, we will use the NoLiTSA implementation. Let's first see how the plot behaves with the sine function.
As we can see, the first local minima seems to be located on the time delay d = 4.

Defining the number of dimensions

To define the number of dimensions, we will use the false nearest neighbors. The false nearest neighbors is a metric that defines, given a dimension space, how many points are being considered as neighbors, but are not really neighbors if we unfold the space with one more dimension.
The common use for this metric is to select the first point where the FNN function returns less than 20%. Let's see that for the sine function.

As we can see, for two dimensions the FNN go below 20%, therefore m=2.

Let's compare now this with the giotto tda implementation:
As we can see, the implementations differ a lot. One way we can use it is to select some options that seem to be good given those two methods and test how well the model performs with each one.

Embedding the Airline Data

Embedding the Shampoo Data

Embedding the Lynx Data

As we can see, the attractors for the real datasets are not well defined as the ones from the toy datasets. Because of that one can expect to have lower predictive power on those datasets unless some tactic to make them more deterministic is used. We will see more about that below.

Applying the theorem for forecasting

Now, let's see how we can use this theorem to do a forecasting on some time series using a standard machine learning algorithm called Random Forest. We hope that we will be able to achieve a good forecasting on the phase space and that we will be able to use it on the time space.

Sine Function

Lorenz Data

Airline Data

Shampoo Data

Lynx Data

As expected, the toy datasets are easier to predict than the real ones. Only the Lynx dataset seems to be well behaved enough to achieve a 'sort of' good prediction. One way we can try to improve this forecasting is by applying Empirical Mode Decomposition to it to get the deterministic portion of the time series and then use the Taken's Theorem to predict it.

Nyquist–Shannon Sampling Theorem

Sampling is a process of converting a signal (for example, a function of continuous time or space) into a sequence of values (a function of discrete time or space).

Nyquist's theorem states that a periodic signal must be sampled at more than twice the highest frequency component of the signal.

Suppose the highest frequency component, in hertz, for a an analog signal is fmax. According to the Nyquist Theorem, the sampling rate must be at least 2fmax, or twice the highest analog frequency component. The sampling in an analog-to-digital converter is actuated by a pulse generator (clock). If the sampling rate is less than 2fmax, some of the highest frequency components in the analog input signal will not be correctly represented in the digitized output. When such a digital signal is converted back to analog form by a digital-to-analog converter, false frequency components appear that were not in the original analog signal. This undesirable condition is a form of distortion called aliasing.

If we take frequency lesser, as per Nyquist rate, this is called "aliasing" in which we are not able to recover the full signal and there will be loss.

We can see the loss in the 30Hz, it can't capture proper cycle. We can't recover the signal properly below Nyquist level.